fix(gpu): add WSL2 GPU support via CDI mode by tyeth-ai-assisted · Pull Request #411 · NVIDIA/OpenShell

tyeth-ai-assisted · 2026-03-17T22:03:25Z

Summary

Detect WSL2 at gateway startup (/dev/dxg present) and automatically configure CDI-based GPU injection
Fixes the complete nvidia-device-plugin failure chain on WSL2: NFD can't see PCI, NVML can't init without libdxcore.so, CDI spec missing per-GPU UUID entries
All changes are in cluster-entrypoint.sh — no Rust, Dockerfile, or manifest changes needed

What it does

When GPU_ENABLED=true and /dev/dxg exists (WSL2), the entrypoint:

Generates CDI spec via nvidia-ctk cdi generate (auto-detects WSL mode)
Adds per-GPU UUID and index device entries (nvidia-ctk only generates name=all, but the device plugin assigns GPUs by UUID)
Bumps CDI spec version from 0.3.0 to 0.5.0 (library minimum)
Patches the spec to include libdxcore.so (upstream nvidia-ctk bug — nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739)
Switches nvidia-container-runtime from auto to cdi mode
Deploys a k3s Job to label the node with pci-10de.present=true (NFD can't detect NVIDIA PCI on WSL2's virtualised bus)

On non-WSL2 hosts, the new code path is never entered (/dev/dxg doesn't exist).

Testing

Verified on:

Hardware: Framework 16 laptop, AMD CPU, NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 shared
OS: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
Driver: NVIDIA 595.71, CUDA 13.2
Result: nvidia-device-plugin 1/1 Running, nvidia.com/gpu: 1 advertised, nvidia-smi works inside sandbox pods, full NemoClaw onboard + sandbox creation + local inference (ollama nemotron 70B) working end-to-end

Closes bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so #404
Upstream bug: nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739 (nvidia-ctk cdi generate misses libdxcore.so on WSL2)
Related to feat: Use CDI for GPU injection instead of nvidia-container-cli #398 (CDI migration) — WSL2 is a concrete platform where legacy injection is broken and CDI is the only viable path

Agent Investigation

Diagnosed using openshell doctor commands. Full diagnostic chain documented in #404.

🤖 Generated with Claude Code

…chart WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia* device nodes, which breaks the entire NVIDIA k8s device plugin detection chain. Three changes fix this: 1. Detect WSL2 in cluster-entrypoint.sh and configure CDI mode: - Generate CDI spec with nvidia-ctk (auto-detects WSL mode) - Patch the spec to include libdxcore.so (nvidia-ctk bug omits it) - Switch nvidia-container-runtime from auto to cdi mode - Deploy a job to label the node with pci-10de.present=true (NFD can't see NVIDIA PCI on WSL2's virtualised bus) 2. Bundle the nvidia-device-plugin Helm chart in the cluster image instead of fetching from the upstream GitHub Pages repo at startup. The repo URL (nvidia.github.io/k8s-device-plugin/index.yaml) currently returns 404. 3. Update the HelmChart CR to reference the bundled local chart tarball via the k3s static charts API endpoint. Closes NVIDIA#404

The upstream Helm repo URL works fine; remove the unnecessary chart bundling and local reference changes.

WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia* device nodes, which breaks the entire NVIDIA k8s device plugin detection chain. This patch detects WSL2 at container startup and applies fixes: 1. Generate CDI spec with nvidia-ctk (auto-detects WSL mode) 2. Add per-GPU UUID and index device entries to CDI spec (nvidia-ctk only generates name=all but the device plugin assigns GPUs by UUID) 3. Bump CDI spec version from 0.3.0 to 0.5.0 (library minimum) 4. Patch the spec to include libdxcore.so (nvidia-ctk bug omits it; this library bridges Linux NVML to the Windows DirectX GPU Kernel) 5. Switch nvidia-container-runtime from auto to cdi mode 6. Deploy a job to label the node with pci-10de.present=true (NFD can't see NVIDIA PCI on WSL2's virtualised bus) Closes NVIDIA#404

github-actions · 2026-03-17T22:03:34Z

Thank you for your interest in contributing to OpenShell, @tyeth-ai-assisted.

This project uses a vouch system for first-time contributors. Before submitting a pull request, you need to be vouched by a maintainer.

To get vouched:

Open a Vouch Request discussion.
Describe what you want to change and why.
Write in your own words — do not have an AI generate the request.
A maintainer will comment /vouch if approved.
Once vouched, open a new PR (preferred) or reopen this one after a few minutes.

See CONTRIBUTING.md for details.

github-actions · 2026-03-17T22:03:34Z

Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:

I have read the DCO document and I hereby sign the DCO.

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the DCO Assistant Lite bot.}

tyeth · 2026-03-17T22:05:47Z

I have read the DCO document and I hereby sign the DCO.

tyeth-ai-assisted · 2026-03-17T22:06:04Z

I have read the DCO document and I hereby sign the DCO.

WSL2 GPU support: - Add wsl2-gpu-fix.sh that applies CDI mode, libdxcore.so injection, and node labeling after gateway start (workaround until OpenShell ships native WSL2 support via NVIDIA/OpenShell#411) - Hook it into both onboard.js (interactive wizard) and setup.sh (legacy script) so it runs automatically after gateway creation - Writes a complete CDI spec from scratch instead of fragile sed patching of the nvidia-ctk generated spec Ollama on Linux: - setup.sh only created the ollama-local provider on macOS (Darwin) - Now detects ollama on any platform (Linux/WSL2 included) - Enables local GPU inference via ollama for WSL2 users Closes NVIDIA/NemoClaw#TBD See also: NVIDIA/OpenShell#404, NVIDIA/OpenShell#411

tyeth added 3 commits March 17, 2026 20:09

fix(gpu): revert helm chart bundling, keep only WSL2 CDI fix

af1ae24

The upstream Helm repo URL works fine; remove the unnecessary chart bundling and local reference changes.

tyeth-ai-assisted requested a review from a team as a code owner March 17, 2026 22:03

github-actions bot closed this Mar 17, 2026

tyeth-ai-assisted mentioned this pull request Mar 17, 2026

[BUG] nemoclaw onboard forces --gpu on WSL2, sandbox DOA (workaround included) NVIDIA/NemoClaw#208

Closed

tyeth mentioned this pull request Mar 17, 2026

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so #404

Open

tyeth-ai-assisted mentioned this pull request Mar 17, 2026

fix(gpu): add WSL2 GPU support and ollama provider on Linux NVIDIA/NemoClaw#254

Draft

This was referenced Mar 18, 2026

feat: Use CDI for GPU injection instead of nvidia-container-cli #398

Open

fix(gpu): add WSL2 GPU support via CDI mode #441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpu): add WSL2 GPU support via CDI mode#411

fix(gpu): add WSL2 GPU support via CDI mode#411
tyeth-ai-assisted wants to merge 3 commits intoNVIDIA:mainfrom
tyeth-ai-assisted:fix/wsl2-gpu-support

tyeth-ai-assisted commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

tyeth commented Mar 17, 2026

Uh oh!

tyeth-ai-assisted commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tyeth-ai-assisted commented Mar 17, 2026

Summary

What it does

Testing

Related

Agent Investigation

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

tyeth commented Mar 17, 2026

Uh oh!

tyeth-ai-assisted commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants